First published: 2020-08-21. Last updated: 2020-08-21.
The Dynamo Paper is a fascinating look at Amazon's work on developing a distributed data store and has influenced many other distributed storage systems like Riak, Cassandra and Voldemort. After going through the paper, I have a better understand of the specific requirements and context that led Amazon to its design and the tradeoffs that were made. It's all about the tradeoffs 🤠!
The Dynamo paper has a ton of info in it but here are some of my takeaways:
- A core input to Dynamo's design is the emphasis that "all customers have a good experience, rather than just the majority", so the SLA is expressed and measured at the 99.9th percentile of the distribution rather than the median or the mean. This perspective seems to have driven several of the design decisions like single node routing ("zero-hop DHT") and choosing the coordinator for a write to be the node that replied fastest to the previous read operation.
- From CAP theorem standpoint, Dynamo has high availability and eventual consistency, with a particular focus on write availability. In the case of the e-commerce cart, Amazon was okay with Dynamo return old cart being returned, even a cart with previously deleted items, as long as users could submit updates to their cart and an "add to cart" operation is never lost. Reconciliation of branching data is handled by logic on the clients during a read->write and can be business logic specific or timestamp based. Peter Alvaro used an abstract of a set in an interesting way to characterize how the cart operations can be reconciled in this episode of Software Engineering Daily by noting that set operations are commutative.
- It is important that Dynamo allows for the values of N, R, W to be tuned. The common configuration for (N, R, W) is (3, 2, 2). According to Distributed systems for fun and profit, Riak also uses (N = 3, R = 2, W = 2) as a default while Cassandra use (N = 3, R = 1, W = 1) by default, which implies that Cassandra typically has higher availability (less hosts need to be alive to process requests) but has an increased risk of inconsistency.
- The papers mentions a couple times that Dynamo is designed for an environment were all nodes are friendly. It's interesting that this allows Dynamo to not focus on the problem of data integrity and security. A bit of a separate topic but it reminds me of this Vitalik Buterin quote : "A blockchain, fundamentally, is not a technical improvement. From a technical point of view, blockchains make things less efficient because you need every transaction to be verified a bunch of times, the benefit is social".